Goto

Collaborating Authors

 sparse network


Sparse Winning Tickets are Data-Efficient Image Recognizers

Neural Information Processing Systems

Improving the performance of deep networks in data-limited regimes has warranted much attention. In this work, we empirically show that "winning tickets" (small subnetworks) obtained via magnitude pruning based on the lottery ticket hypothesis [1], apart from being sparse are also effective recognizers in data-limited regimes. Based on extensive experiments, we find that in low data regimes (datasets of 50-100 examples per class), sparse winning tickets substantially outperform the original dense networks. This approach, when combined with augmentations or fine-tuning from a self-supervised backbone network, shows further improvements in performance by as much as 16% (absolute) on low sample datasets and longtailed classification. Further, sparse winning tickets are more robust to synthetic noise and distribution shifts compared to their dense counterparts. Our analysis of winning tickets on small datasets indicates that, though sparse, the networks retain density in the initial layers and their representations are more generalizable.








Graph energy as a measure of community detectability in networks

arXiv.org Machine Learning

A key challenge in network science is the detection of communities, which are sets of nodes in a network that are densely connected internally but sparsely connected to the rest of the network. A fundamental result in community detection is the existence of a nontrivial threshold for community detectability on sparse graphs that are generated by the planted partition model (PPM). Below this so-called ``detectability limit'', no community-detection method can perform better than random chance. Spectral methods for community detection fail before this detectability limit because the eigenvalues corresponding to the eigenvectors that are relevant for community detection can be absorbed by the bulk of the spectrum. One can bypass the detectability problem by using special matrices, like the non-backtracking matrix, but this requires one to consider higher-dimensional matrices. In this paper, we show that the difference in graph energy between a PPM and an Erdős--Rényi (ER) network has a distinct transition at the detectability threshold even for the adjacency matrices of the underlying networks. The graph energy is based on the full spectrum of an adjacency matrix, so our result suggests that standard graph matrices still allow one to separate the parameter regions with detectable and undetectable communities.


A Theoretical View on Sparsely Activated Networks

Neural Information Processing Systems

Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive for inference. To mitigate this, one promising research direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm in Switch Transformers). However, there is no theoretical grounding for these sparsely activated models. As our first contribution, we present a formal model of data-dependent sparse networks that captures salient aspects of popular architectures.


Channel Permutations for N:M Sparsity

Neural Information Processing Systems

We introduce channel permutations as a method to maximize the accuracy of N:M sparse networks. N:M sparsity requires N out of M consecutive elements to be zero and has been shown to maintain accuracy for many models and tasks with a simple prune and fine-tune workflow. By permuting weight matrices along their channel dimension and adjusting the surrounding layers appropriately, we demonstrate accuracy recovery for even small, parameter-efficient networks, without affecting inference run-time. We also present both a quality metric to simplify judging permutations as well as efficient methods to search for high-quality permutations, including two optimizations to escape local minima. Finally, we share an ablation study to show the importance of each part of our search algorithm, experimental results showing correlation between our quality metric and final network accuracy, improved sparse network accuracy using our techniques with insignificant overhead to training time, and the transformation of unstructured to structured sparse workloads.